%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Pipeline Stages in \SIM}
\label{sec:pipeline}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\SIM implements a five stage pipelined processor whose pipeline depth can 
be varied using knobs. The stages are - Fetch, Decode, Schedule and Execute (includes
Memory) and Retire. This chapter describes the basic feature of each
stage.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fetch Stage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This stage models the instruction fetch stage, its main tasks
are as follows:

\begin{enumerate}

\item Determine the thread from which instructions will be fetched next \footnote{for
    processors that are not multi-threaded, the same thread is selected
    always}, this involves checking whether a thread can actually fetch a new
	instruction or not

\item Access I-cache \footnote {when there is an I-cache miss, the front-end cannot fetch
    instructions for the selected thread}

\item Read a uop from trace (call {\textit get\_uops\_from\_traces()})

\item BTB access and branch prediction.
  \begin{itemize}
   \item Different branch predictors are implemented using the \textit{bp\_factory}
   \item Type of branch predictor to be used is set using the bp\_dir\_mech knob
  \end{itemize}

\item Send uop to queue meant for modeling the depth of the processor
  front-end.
  \begin{itemize}
   \item Depth of the front-end stage is set using one of the
	  \Verb+fetch_latency+ knobs.\footnote{There is one knob for each core
		type, small, medium, and large}
   \item Different thread fetch policies can be implemented. More details in
		Section~\ref{sec:modify:fetch}.
   \item Default fetch policy is \Verb+round-robin+ and \Verb+fetch_policy+ knob 
		will specify the policy to be used.
  \end{itemize}
  
\end{enumerate}


\ignore
		{ Different fetch polices are implemented using virtual
		  function.  {\textit fetch\_factory} function sets different {\textit
		  frontend\_c}.  Currently, a thread is selected based on the
		  round-robin policy. If the next thread is blocked to fetch, the next
		  candidate is selected based on the round-robin policy.  Different
		  branch predictor policies are also implemented and it is also set by
		  {\textit bp\_factory}. The type of branch predictor or the fetch
		  polices are set by knobs.}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reading Traces}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Reading of traces is mainly performed in \Verb+trace_read.cc+. The
main function of this step during simulation is to read a macro
instruction from the trace file and convert it into a sequence of one
or more micro-ops (uops). A simple decoding algorithm is used for
generating uops. To improve the speed of simulation, decoded uops are
stored in a hash table for reuse when the same static instruction is
encountered again during simulation. Note that to obtain dynamic
information such as addresses accessed, uops must be decoded partially
every time.

\ignore 
	  { The main task of trace read is reading an macro instruction from
	  trace file and convert it to uops.  A simple decoding algorithm is used to
	  generate micro uops.  To improve the simulation speed, decoded uops are
	  stored in a hash table.  Which trace file to open is decided in
	  process\_manager.cc }

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Decode and  Allocate Stage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\ignore
	  {
	  \todo{Decode is done in frontend.cc!}
	  \todo{according to code, frontend\_q latency = fetch\_latency + alloc\_latency}
	  \todo{according to code, alloc\_q latency = alloc\_to\_exec\_latency}
	  }

In this stage the resource requirement for each uop is calculated and space is
allocated in the required structures for the uop. At the end of this stage, each
uop occupies an entry in ROB and one of the issues queues - general purpose
(integer), floating point or memory. Depending on the knob values, there could
be a unified issue queue or distributed issue queues. For conditional branches
with BTB misses, the branch target is resolved in this stage as well.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Schedule Stage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The schedule stage implements the instruction scheduling logic. Currently
in-order, out-of-order and GPU schedulers are supported. The GPU scheduler
issues uops from the same warp/SIMD thread in order, but uops from different
warps/SIMD threads can be issued out of order in comparison to the order in
which they were dispatched. Before executing an uop the instruction scheduler
first checks for the availability of source operands, and then for the
availability of ports (functional units) for uop execution. 

\ignore
		{ The scheduler stage implements scheduling logic. Currently it supports
		in-order, out-of-order, and GPU scheduling mechanism.  First, it checks the
		source code availability and then functional unit availability-by checking
		ports.  The scheduler is implemented in a virtual function so different
		scheduling policies overload the scheduling functions.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Execution Stage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this stage, all instructions - both non-memory and memory - are executed.
The member variable \textit{m\_done\_cycle} of the uop structure indicates the
cycle in which the uop will complete. The following tasks are executed in this
stage:

\begin{itemize}

\item For non-memory uops, the latency of the the instruction is determined, an
	  execution port is marked as in use and the \textit{m\_done\_cycle} of the uop
	  is set. The latencies of uops are defined in the file
	  \textit{../def/uoplatency\_ptx.def} which gets included during the compilation
	  of the \SIM binary. 
	  
\item Branch instructions are resolved in this stage.

\item For memory uops, D-cache \footnote{or the appropriate memory structure such as
    Constant cache, Shared Memory, ...} is accessed and an execution port is
	marked as in use. If a D-cache port is available, then the port is marked as
	used, however, if a port is unavailable, the execution of the uop is attempted
	again later on. On D-cache hit, the \textit{m\_done\_cycle} of the uop is set.
	Handling of D-cache misses is explained in Section~\ref{sec:memory}.

\item Uncoalesced memory instructions result in multiple accesses to the the D-cache.

\ignore
		{ \item \todo[inline]{The number of function units can be specified via
		these knobs for the core type being used - \textit{int\_sched\_rate,
		mem\_sched\_rate, fp\_sched\_rate} - code does not use these in that
		sense}.  }

\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Retire Stage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This stage handles the in-order retirement of uops. Also, memory allocated to a uop
is freed (returned to memory pool). This stage also checks of completion of
warps, blocks, kernels ,and entire application and triggers the
repetition of applications if the knob for repetition is set. (\textit{repeat\_trace})

\ignore
		{
		\todo[inline]{At present, there are no rename and writeback stages in MacSim since
		they are not necessary to support correct and accurate simulation. MacSim
		enforces RAW dependencies, while WAR and WAW dependences can be ignored by
		assuming that once a instruction completes, its results are available for
		dependent instructions without being overwritten.}
		}

\ignore 
		  {
		  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
		  \section{Queues}
		  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

		  frontend queue
		  ia queue
		  rob
		  scheduler
		  }


% LocalWords:  doxygen frontend bp uops  GPU uop dcache uncoalesced BTB fe mem
% LocalWords:  sched fp icache ptx
